Finding the Number of Clusters in Unlabeled Datasets using Extended Dark Block Extraction
نویسندگان
چکیده
Clustering analysis is the problem of partitioning a set of objects O = {o1... on} into c self-similar subsets based on available data. In general, clustering of unlabeled data poses three major problems: 1) assessing cluster tendency, i.e., how many clusters to seek? 2) Partitioning the data into c meaningful groups, and 3) validating the c clusters that are discovered. We address the first problem, i.e., determining the number of clusters c prior to clustering. Many clustering algorithms require number of clusters as an input parameter, so the quality of the clusters mainly depends on this value. Most methods are post clustering measures of cluster validity i.e., they attempt to choose the best partition from a set of alternative partitions. In contrast, tendency assessment attempts to estimate c before clustering occurs. Here, we represent the structure of the unlabeled data sets as a Reordered Dissimilarity Image (RDI), where pair wise dissimilarity information about a data set including ̳n‘ objects is represented as nxn image. RDI is generated using VAT (Visual Assessment of Cluster tendency), RDI highlights potential clusters as a set of ―dark blocks‖ along the diagonal of the image. So, number of clusters can be easily estimated using the number of dark blocks across the diagonal. We develop a new method called ―Extended Dark Block Extraction (EDBE) for counting the number of clusters formed along the diagonal of the RDI. EDBE method combines several image and signal processing techniques. General Terms: Data Mining, Image Processing, Artificial Intelligence.
منابع مشابه
A Comparative study of Clustering in Unlabelled Datasets Using Extended Dark Block Extraction and Extended Cluster Count Extraction
One of the major problems in cluster analysis is the determination of the number of clusters in unlabeled data prior to clustering. In this paper, we implement a new method for determining the number of clusters called Extended Dark Block Extraction (EDBE), which is based on an existing algorithm for Visual Assessment of Cluster Tendency (VAT) of a data set. Its basic steps include 1) Generatin...
متن کاملFinding the Number of Clusters in Unlabelled Datasets Using Extended Cluster Count Extraction (ECCE)
Clustering analysis is the task of partitioning a set of objects O = {O1... On} into C self-similar subsets based on available data. In general, clustering of unlabeled data poses three major problems: 1) Assessing cluster tendency, i.e., how many clusters to seek? 2) Partitioning the data into C meaningful groups, and 3) Validating the c clusters that are discovered. All clustering algorithms ...
متن کاملEnhanced Dark Block Extraction Method Performed Automatically to Determine the Number of Clusters in Unlabeled Data Sets
Abstract: One of the major issues in data cluster analysis is to decide the number of clusters or groups from a set of unlabeled data. In addition, the presentation of cluster should be analyzed to provide the accuracy of clustering objects. This paper propose a new method called Enhanced-Dark Block Extraction (E-DBE), which automatically identifies the number of objects groups in unlabeled dat...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملخوشهبندی خودکار دادهها با بهرهگیری از الگوریتم رقابت استعماری بهبودیافته
Imperialist Competitive Algorithm (ICA) is considered as a prime meta-heuristic algorithm to find the general optimal solution in optimization problems. This paper presents a use of ICA for automatic clustering of huge unlabeled data sets. By using proper structure for each of the chromosomes and the ICA, at run time, the suggested method (ACICA) finds the optimum number of clusters while optim...
متن کامل